Add Support for 2/3/8-bit GPTQ Quantization Models #2330

chu-tianxiang · 2024-01-03T16:46:37Z

There's already a pull request supporting varying quantization bit levels for GPTQ models, leveraging kernels from the AutoGPTQ repository. This PR presents an alternative approach inspired by exllamav2.

While exllamav2 doesn't natively support 2&3&8-bit GPTQ models, it possesses the essential components. In essence, EXL2 operates as a mixed-bit GPTQ model, 2&3&8-bit models can be seen as special cases. Although there are minor differences in scales and zero points, these can be easily adjusted.

Following is the comparison of latency and throughput of LLama2-7B under different bit quantization in single A100. 4-bit is the from main branch with cuda-graph fix while 3-bit and 8-bit are newly added. All measured using the benchmark_latency.py and benchmark_throughput.py scripts. (2-bit GPTQ models can hardly generate coherent output and is of no practical value, so I didn't include it below)

bit	Single Query Latency	Throughput
fp16	93 tokens/s	9.24 requests/s
4	190 tokens/s	7.76 requests/s
3	206 tokens/s	7.80 requests/s
8	135 tokens/s	7.38 requests/s

This has not been tested on ROCm device yet.

JasonZhu1313 · 2024-01-04T16:06:20Z

Hey @chu-tianxiang, what's the request rate / QPS for your throughput test? Any intuition on why we've seen ~2x tokens per second but lower throughput?

chu-tianxiang · 2024-01-05T03:22:33Z

Hey @chu-tianxiang, what's the request rate / QPS for your throughput test? Any intuition on why we've seen ~2x tokens per second but lower throughput?

I used the benchmark_throughput.py which adds all request before running the inference instead of sending at some request rate.
The 2x tokens per second only happens for very low batch size when memory wall matters, while the large batch performance is actually worse than fp16 due to the extra computation cost of dequantization.

raywanb · 2024-02-23T22:55:22Z

Hey @chu-tianxiang, can you please update this pr to the latest master branch.

aliencaocao · 2024-02-25T04:44:07Z

looking forward to this getting merged!

esmeetu · 2024-02-28T04:27:41Z

@chu-tianxiang I have tested this feature using model:https://huggingface.co/TheBloke/WizardCoder-33B-V1.1-GPTQ/tree/gptq-8bit--1g-actorder_True. It’s ok when setting max-model-len=8192, but not with 16384. It will cause illegal memory access. And I have no idea with this. Besides, I change the max_postion_embeddingsin config.json to 4096.

WoosukKwon

LGTM! Awesome! Thanks for the PR and apologies for the delayed review.

aliencaocao · 2024-02-28T23:28:06Z

Gptq 8 bit doesnt work on v100, cannot compile as require sm80 and above (its marlin and quip kernels). Any plan to fix that since v100 is in the official supported list?

chu-tianxiang · 2024-02-29T05:44:49Z

@esmeetu I tested the model in the link and cannot reproduce the illegal memory access error. Could you please provide more details about the setup and code?

@aliencaocao this PR doesn't include marlin or quip kernels, I guess you're talking about the gptq_hf branch. I'll add a cuda arch guard for those kernels, thanks for the report.

aliencaocao · 2024-02-29T06:25:55Z

Yes i meant the gptq hf branch. I figured it out myself by removing all quip and marlin codes and it works for me.

chu-tianxiang added 6 commits December 24, 2023 21:01

Add initial support

66067a6

Fix minor bug

439c676

Add 3-bit quant

70f4c30

Merge branch 'main' into gptq_8bit

eb66aad

Fix CUDA Graph

aa2df9b

Fix style

0a94bdd

WoosukKwon added the quantization label Jan 3, 2024

Merge main branch

37c700f

Merge main

9ec2052

simon-mo requested a review from WoosukKwon February 28, 2024 00:18

WoosukKwon mentioned this pull request Feb 28, 2024

[v0.3.3] Release Tracker #3097

Closed

5 tasks

WoosukKwon reviewed Feb 28, 2024

View reviewed changes

WoosukKwon approved these changes Feb 29, 2024

View reviewed changes

WoosukKwon merged commit 01a5d18 into vllm-project:main Feb 29, 2024
22 checks passed

chu-tianxiang mentioned this pull request Feb 29, 2024

GPTQ & AWQ Fused MOE #2761

Closed

3 tasks

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Add Support for 2/3/8-bit GPTQ Quantization Models (vllm-project#2330)

3b42cc3

This was referenced Mar 6, 2024

GPTQ / Quantization support? #174

Closed

Add GPTQ quantization kernels for 2, 3, 8-bit use cases #2223

Closed

duchengyao mentioned this pull request Mar 12, 2024

[Feature Request] Add GPTQ quantization kernels for 4-bit NormalFloat (NF4) use cases. #3339

Closed

wxupjack mentioned this pull request Mar 14, 2024

ExLlamaV2: exl2 support #3203

Open

hmellor mentioned this pull request Apr 12, 2024

[Feature]: Support Int8 dtype for storing weights - currently uses FP16 wasting 50% of VRAM #4031

Closed

mgoin mentioned this pull request Nov 26, 2024

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for 2/3/8-bit GPTQ Quantization Models #2330

Add Support for 2/3/8-bit GPTQ Quantization Models #2330

chu-tianxiang commented Jan 3, 2024

JasonZhu1313 commented Jan 4, 2024

chu-tianxiang commented Jan 5, 2024

raywanb commented Feb 23, 2024

aliencaocao commented Feb 25, 2024

esmeetu commented Feb 28, 2024

WoosukKwon left a comment

aliencaocao commented Feb 28, 2024

chu-tianxiang commented Feb 29, 2024

aliencaocao commented Feb 29, 2024

Add Support for 2/3/8-bit GPTQ Quantization Models #2330

Add Support for 2/3/8-bit GPTQ Quantization Models #2330

Conversation

chu-tianxiang commented Jan 3, 2024

JasonZhu1313 commented Jan 4, 2024

chu-tianxiang commented Jan 5, 2024

raywanb commented Feb 23, 2024

aliencaocao commented Feb 25, 2024

esmeetu commented Feb 28, 2024

WoosukKwon left a comment

Choose a reason for hiding this comment

aliencaocao commented Feb 28, 2024

chu-tianxiang commented Feb 29, 2024

aliencaocao commented Feb 29, 2024